- Home
- Search Results
- Page 1 of 1
Search for: All records
-
Total Resources2
- Resource Type
-
0002000000000000
- More
- Availability
-
20
- Author / Contributor
- Filter by Author / Creator
-
-
Molchanov, Pavlo (2)
-
Adve, Sarita V. (1)
-
Cai, Ruisi (1)
-
Fletcher, Christopher W. (1)
-
Heinrich, Greg (1)
-
Kautz, Jan (1)
-
Keckler, Stephen W. (1)
-
Mahmoud, Abdulrahman (1)
-
Muralidharan, Saurav (1)
-
Sakr, Charbel (1)
-
Sastry Hari, Siva Kumar (1)
-
Shanbhag, Naresh (1)
-
Sullivan, Michael B. (1)
-
Tsai, Timothy (1)
-
Wang, Zhangyang (1)
-
Yin, Hongxu (1)
-
#Tyler Phillips, Kenneth E. (0)
-
#Willis, Ciara (0)
-
& Abreu-Ramos, E. D. (0)
-
& Abramson, C. I. (0)
-
- Filter by Editor
-
-
& Spizer, S. M. (0)
-
& . Spizer, S. (0)
-
& Ahn, J. (0)
-
& Bateiha, S. (0)
-
& Bosch, N. (0)
-
& Brennan K. (0)
-
& Brennan, K. (0)
-
& Chen, B. (0)
-
& Chen, Bodong (0)
-
& Drown, S. (0)
-
& Ferretti, F. (0)
-
& Higgins, A. (0)
-
& J. Peters (0)
-
& Kali, Y. (0)
-
& Ruiz-Arias, P.M. (0)
-
& S. Spitzer (0)
-
& Sahin. I. (0)
-
& Spitzer, S. (0)
-
& Spitzer, S.M. (0)
-
(submitted - in Review for IEEE ICASSP-2024) (0)
-
-
Have feedback or suggestions for a way to improve these results?
!
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
raining modern large language models (LLMs) is extremely resource-intensive, and repeatedly customizing them for deployment scenarios with limited compute and memory is impractical. This paper introduces Flextron, a network architecture and post-training model optimization framework that supports flexible model deployment. Flextron uses a nested elastic structure that adapts rapidly to user-defined latency and accuracy targets during inference without requiring additional fine-tuning. It is also input-adaptive, automatically routing tokens through sub-networks for improved efficiency and performance. The authors propose a sample-efficient training method and routing algorithms to systematically transform an already-trained LLM into a Flextron model. Evaluation on the GPT-3 and LLaMA-2 families demonstrates Flextron’s superior performance over end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes only 7.63% of the tokens compared to original pretraining.more » « less
-
Mahmoud, Abdulrahman; Sastry Hari, Siva Kumar; Fletcher, Christopher W.; Adve, Sarita V.; Sakr, Charbel; Shanbhag, Naresh; Molchanov, Pavlo; Sullivan, Michael B.; Tsai, Timothy; Keckler, Stephen W. (, 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE))
An official website of the United States government

Full Text Available